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Abstract 

Penalized regression methods, such as Li regularization, are routinely used in high-dimensional 
applications, and there is a rich literature on optimality properties under sparsity assumptions. In 
the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having 
a probability mass at zero, but such priors encounter daunting computational problems in high 
dimensions. This has motivated an amazing variety of continuous shrinkage priors, which can be 
expressed as global-local scale mixtures of Gaussians, facilitating computation. In sharp contrast 
to the corresponding frequentist literature, very little is known about the properties of such priors. 
Focusing on a broad class of shrinkage priors, we provide precise results on prior and posterior 
concentration. Interestingly, we demonstrate that most commonly used shrinkage priors, including 
the Bayesian Lasso, are suboptimal in high-dimensional settings. A new class of Dirichlet Laplace 
(DL) priors are proposed, which are optimal and lead to efficient posterior computation exploiting 
results from normalized random measure theory. Finite sample performance of Dirichlet Laplace 
priors relative to alternatives is assessed in simulations . 

Keywords: Bayesian; Convergence rate; High dimensional; Lasso; Li, Penalized regression; 
Regularization; Shrinkage prior. 
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1. INTRODUCTION 

High-dimensional data have become commonplace in broad application areas, and there is an ex- 
ponentially increasing literature on statistical and computational methods for big data. In such 
settings, it is well known that classical methods such as maximum likelihood estimation break 
down, motivating a rich variety of alternatives based on penalization and thresholding. Most pe- 
nalization approaches produce a point estimate of a high-dimensional coefficient vector, which 
has a Bayesian interpretation as corresponding to the mode of a posterior distribution obtained 
under a shrinkage prior. For example, the wildly popular Lasso/Li regularization approach to 
regression [12 811 is equivalent to maximum a posteriori (MAP) estimation under a Gaussian lin- 
ear regression model having a double exponential (Laplace) prior on the coefficients. There is 



a rich theoretical 
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iterature justifying the optimality properties of such penalization approaches 



3411 . with fast algorithms [9] and compelling applied results leading to routine 



use of Li regularization in particular. 

The overwhelming emphasis in this literature has been on rapidly producing a point estimate 
with good empirical and theoretical properties. However, in many applications, it is crucial to 
be able to obtain a realistic characterization of uncertainty in the parameters, in functionals of 
the parameters and in predictions. Usual frequentist approaches to characterize uncertainty, such 
as constructing asymptotic confidence regions or using the bootstrap, can break down in high- 
dimensional settings. For example, in regression when the number of subjects n is much less than 
the number of predictors p, one cannot naively appeal to asymptotic normality and resampling 
from the data may not provide an adequate characterization of uncertainty. 

Given that most shrinkage estimators correspond to the mode of a Bayesian posterior, it is nat- 
ural to ask whether we can use the whole posterior distribution to provide a probabilistic measure 
of uncertainty. Several important questions then arise. Firstly, from a frequentist perspective, we 
would like to be able to choose a default shrinkage prior that leads to similar optimality proper- 
ties to those shown for Li penalization and other approaches. However, instead of showing that a 
particular penalty leads to a point estimator having a minimax optimal rate of convergence under 
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sparsity assumptions, we would like to obtain a (much stronger) result that the entire posterior 
distribution concentrates at the optimal rate, i.e., the posterior probability assigned to a shrinking 
neighborhood (proportionally to the optimal rate) of the true value of the parameter converges to 
one. In addition to providing a characterization of uncertainty, taking a Bayesian perspective has 
distinct advantages in terms of tuning parameter choice, allowing key penalty parameters to be 
marginalized over the posterior distribution instead of relying on cross-validation. Also, by induc- 
ing penalties through shrinkage priors, important new classes of penalties can be discovered that 
may outperform usual Lg-type choices. 

An amazing variety of shrinkage priors have been proposed in the Bayesian literature, with 
essentially no theoretical justification for the performance of these priors in the high-dimensional 
settings for which they were designed. [Illll and H provided conditions on the prior for asymp- 
totic normality of linear regression coefficients allowing the number of predictors p to increase 
with sample size n, with [uJ] requiring a very slow rate of growth and [sj] assuming p < n. These 
results required the prior to be sufficiently flat in a neighborhood of the true parameter value, es- 
sentially ruling out shrinkage priors. [|2|] considered shrinkage priors in providing simple sufficient 
conditions for posterior consistency in p < n settings, while [|27|] studied finite sample posterior 
contraction inp ^ n settings. 

In studying posterior contraction in high-dimensional settings, it becomes clear that it is critical 
to obtain tight bounds on prior concentration. This substantial technical hurdle has prevented any 
previous results (to our knowledge) on posterior concentration in p ^ n settings for shrinkage 
priors. In fact, prior concentration is critically important not just in studying frequentist optimality 
properties of Bayesian procedures but for Bayesians in obtaining a better understanding of the 
behavior of their priors. Without a precise handle on prior concentration, Bayesians are operating 
in the dark in choosing shrinkage priors and the associated hyperparameters. It becomes an art 
to use intuition and practical experience to indirectly induce a shrinkage prior, while focusing on 
Gaussian scale families for computational tractability. Some beautiful classes of priors have been 



proposed by [|2il5L 



ISO among others, with [23] showing that essentially all existing shrinkage priors 



fall within the Gaussian global-local scale mixture family. One of our primary goals is to obtain 



theory that can allow evaluation of existing priors and design of novel priors, which are appealing 
from a Bayesian perspective in allowing incorporation of prior knowledge and from a frequentist 
perspective in leading to minimax optimality under weak sparsity assumptions. 

Shrinkage priors provide a continuous alternative to point mass mixture priors, which include 
a mass at zero mixed with a continuous density. These priors are highly appealing in allowing 
separate control of the level of sparsity and the size of the signal coefficients. In a beautiful re- 
cent article, \m showed optimality properties for carefully chosen point mass mixture priors in 
high-dimensional settings. Unfortunately, such priors lead to daunting computational hurdles in 
high-dimensions due to the need to explore a 2^ model space; an NP-hard problem. Continu- 
ous scale mixtures of Gaussian priors can potentially lead to dramatically more efficient posterior 
computation. 

Focusing on the normal means problem for simplicity in exposition, we provide general theory 
on prior and posterior concentration under shrinkage priors. One of our main results is that a broad 



class of Gaussian scale mixture priors, including the Bayesian Lasso [|21|] and other commonly 
used choices such as ridge regression, are sub-optimal. We provide insight into the reasons for this 
sub-optimality and propose a new class of Dirichlet-Laplace (DL) priors, which are optimal and 
lead to efficient posterior computation. We show promising initial results for DL and Dirichlet- 
Cauchy (DC) priors relative to a variety of competitors. 

2. PRELIMINARIES 

In studying prior and posterior computation for shrinkage priors, we require some notation and 
technical concepts. We introduce some of the basic notation here. Technical details in the text are 
kept to a minimum, and proofs are deferred to a later section. 

Given sequences a„, 6„, we denote a„ = 0(6„) if there exists a global constant C such that 
a.„ < Cbn and a„ = o(6„) if a„/6n — t- as n — t- oo. For a vector x E W, \\x\\2 denotes its 
Euclidean norm. We will use A""^^ to denote the (r — 1) -dimensional simplex {x = (xi, . . . , XrY ■ 
Xj > 0, X]j=i = !}• Further, let Ag"^ denote {x = {xi, . . . , Xr-iY : Xj > 0, Y7j=i ^ 

For a subset S C {1, . . . , n}, let l^l denote the cardinality of S and define 9s = (dj ■ j E S) 



for a vector 9 G W\ Denote supp(6') to be the support of 9, the subset of {1, . . . , n} corresponding 
to the non-zero entries of 9. Let /o[g; n] denote the subset of M" given by 

/o[g;n] = G R'^ : #(1 < J < n : 9j 0) < q}. 

Clearly, l^lq; n] consists of g-sparse vectors 9 with |supp(6')| < q. 

Let DE(r) denote a zero mean double-exponential or Laplace distribution with density f{y) = 
(2r)~^e^l^l/^ for y G M. Also, we use the following parametrization for the three-parameter 
generalized inverse Gaussian (giG) distribution: Y ~ giG(A,p, x) if f{y) oc y^~'^e^'^'^^''y~^^^y^ for 
y>0. 

3. CONCENTRATION PROPERTIES OF GLOBAL-LOCAL PRIORS 
3.1 Motivation 

For a high-dimensional vector 9 G M", a natural way to incorporate sparsity in a Bayesian frame- 
work is to use point mass mixture priors 

9j ~ (1 - 7r)(5o + nge, j = 1, . . . , n, (1) 

where tt = Pr(^^j ^ 0), E{|supp(6')| \ ir} = mr is the prior guess on model size (sparsity level), 
and ge is an absolutely continuous density on M. It is common to place a beta prior on tt, leading to 
a beta-Bernoulli prior on the model size, which conveys an automatic multiplicity adjustment [|26l]. 

established that prior ([T) with an appropriate beta prior on tt and suitable tail conditions on gg 
leads to a frequentist minimax optimal rate of posterior contraction in the normal means setting. 
We shall revisit the normal means problem in subsection 13.41 

Although point mass mixture priors are intuitively appealing and possess attractive theoretical 
properties, posterior sampling requires a stochastic search over an enormous space in complicated 
models where marginal likelihoods are not available analytically, leading to slow mixing and con- 



vergence 



Computational issues and considerations that many of the 9jS may be small but not 
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exactly zero has motivated a rich 



vast literature refer to [|2, 
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iterature on continuous shrinkage priors; for some flavor of the 



2lll . [I23II noted that essentially all such shrinkage priors can be 



represented as global-local (GL) mixtures of Gaussians, 

%~N(0,^,r), T^g, (2) 

where r controls global shrinkage towards the origin while the local scales {i^j} allow deviations 
in the degree of shrinkage. If g puts sufficient mass near zero and / is appropriately chosen, GL 
priors in (|2]) can intuitively approximate (JT) but through a continuous density concentrated near 
zero with heavy tails. 

GL priors potentially have substantial computational advantages over variable selection priors, 
since the normal scale mixture representation allows for conjugate updating of 6 and in a block. 
Moreover, a number of frequentist regularization procedures such as ridge, lasso, bridge and elastic 
net correspond to posterior modes under GL priors with appropriate choices of / and g. For 
example, one obtains a double-exponential prior corresponding to the popular Li or lasso penalty 
if / has an exponential distribution. However, unlike variable selection priors ([B, many aspects of 
shrinkage priors are poorly understood. For example, even basic properties, such as how the prior 
concentrates around an arbitrary sparse 6q, remain to be shown. Hence, Bayesians tend to operate 
in the dark in using such priors, and frequentists tend to be skeptical due to the lack of theoretical 
justification. 

This skepticism is somewhat warranted, as it is clearly the case that reasonable seeming priors 
can have poor performance in high-dimensional settings. For example, choosing tt = 1/2 in prior 
([U) leads to an exponentially small prior probability of 2~" assigned to the null model, so that it 
becomes literally impossible to override that prior informativeness with the information in the data 



to pick the null model. However, with a beta prior on tt, this problem can be avoided [|26|] . In the 
same vein, if one places i.i.d. N(0, 1) priors on the entries of 6, then the induced prior on ||6'|| is 
highly concentrated around ^/n leading to misleading inferences on 6 almost everywhere. These 
are simple cases, but it is of key importance to assess whether such problems arise for other priors 
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in the GL family and if so, whether improved classes of priors can be found. 

There has been a recent awareness of these issues, motivating a basic assessment of the marginal 
properties of shrinkage priors for a single 9j . Recent priors such as the horseshoe [|5|] and gener- 
alized double Pareto [2] are carefully formulated to obtain marginals having a high concentration 
around zero with heavy tails. This is well justified, but as we will see below, such marginal behav- 
ior alone is not sufficient; it is necessary to study the joint distribution of 9 on R". Specifically, 
we recommend studying the prior concentration P(||6' — ^^qII < ^n) where the true parameter 9q is 
assumed to be sparse: 9o G lo[qn'i n] with the number of non-zero components Qn <^ n and 

t„ = n'^/2 with 5e{0,l). (3) 

In models where g„ <^ n, the prior must place sufficient mass around sparse vectors to allow 
for good posterior contraction; see subsection |3.4| for further details. Now, as a first illustration, 
consider the following two extreme scenarios: i.i.d. standard normal priors for the individual 
components 9j vs. point mass mixture priors given by ([T). 

Theorem 3.1. Assume that 9q G lo[qn] n] with = o{n). Then, for i.i.d standard normal priors 
on 9j, 

P(||^-^o|l2<^n)<e-'=^ (4) 

For point mass mixture priors © with n ~ Beta{l,n + 1) and gg being a standard Laplace 
distribution gg = DE(1), 



^olla < Q > e-^'^^'^^''-l'^""i>. (5) 



Proof. Using ||6'||2 ~ the claim made in (HJ follows from an application of Anderson's in- 
equality (16.11) and standard chi-square deviation inequalities. In particular, the exponentially small 
concentration also holds for P( || 6*0 1| 2 < ^n) • The second claim ^ follows from results in [l6|]. □ 
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As seen from Theorem 13.11 the point mass mixture priors have much improved concentra- 
tion around sparse vectors, as compared to the i.i.d. normal prior distributions. The theoretical 
properties enjoyed by the point mass mixture priors can mostly be attributed to this improved con- 
centration. The above comparison suggests that it is of merit to evaluate a shrinkage prior in high 
dimensional models under sparsity assumption by obtaining its concentration rates around sparse 
vectors. In this paper, we carry out this program for a wide class of shrinkage priors. Our analysis 
also suggests some novel priors with improved concentration around sparse vectors. 

In order to communicate our main results to a wide audience, we will first present specific 
corollaries of our main results applied to various existing shrinkage priors. The main results are 
given in Section [6l Recall the GL priors presented in (|2]) and the sequence in ©. 

3.2 Prior concentration for global priors 

This simplified setting involves only a global parameter, i.e., tpj = 1 for all j. This subclass 
includes the important example of ridge regression, with r routinely assigned an inverse-gamma 
prior, r ~ IG(a, 

Theorem 3.2. Assume 9 ~ GL with ipj = Ifor all j. If the prior f on the global parameter r has 
an IG(a, (5) distribution, then 

P(||^|l2<tn)<e-^"'-\ (6) 



where C > is a constant depending only on a and (3. 

The above theorem shows that compared to i.i.d. normal priors (H)), the prior concentration 
does not improve much under an inverse-gamma prior on the global variance regardless of the hy- 
perparameters (provided they don't scale with n) even when 6*0 = 0. Concentration around 6^0 away 
from zero will clearly be even worse. Hence, such a prior is not well-suited in high-dimensional 

nn 

settings, confirming empirical observations documented in H 10. 12411. It is also immediate that the 
same concentration bound in ^ would be obtained for the giG family of priors on r. 
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In 1I24I1 . the authors instead recommended a half-Cauchy prior as a default choice for the global 



variance (also see UM)- We consider the following general class of densities on (0, 00) for r, to 
be denoted J" henceforth, that satisfy: (i) /(r) < M for all r G (0, 00) (ii) /(r) > 1/M for 
all r G (0, 1), for some constant M > 0. Clearly, T contains the half-Cauchy and exponential 
families. The following result provides concentration bounds for these priors. 

Theorem 3.3. Let ||^o|l2 = o{y/n). If the prior f on the global parameter r belongs to the class 
T above then, 

Furthermore, ?/||6'o||2 > tn, then 

where a„ = ||^o|l2 An > 1 '^'^^ Q, C*i > are constants with Ci, C2, C2 depending only on M in 
the definition ofj^ and Ci depending on M and 6. 

Thus (|7]) in Theorem 13.31 shows that the prior concentration around zero can be dramatically 



improved from exponential to polynomia 



near zero, such as the half-Cauchy prior dlO 



with a careful prior on r that can assign sufficient mass 



2411 . Unfortunately, as ^ shows, for signals of large 



magnitude one again obtains an exponentially decaying probability. Hence, Theorem 13.31 con- 
clusively shows that global shrinkage priors are simply not flexible enough for high-dimensional 
problems. 

Remark 3.4. The condition 1 1 ^0 1 1 2 ^ ^■^ (^f^fy used to prove the lower bound in dS]). For any \\6q\\ 
bounded below by a constant, we would still obtain an upper bound e"*"*^^ * in ([8]), similar to 
the bound in 
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3.3 Prior concentration for a class of GL priors 

Proving concentration results for the GL family (|2) in the general setting presents a much harder 
challenge compared to Theorem 13.31 since we now have to additionally integrate over the n local 
parameters tp = (ipi, . . . , ipn)- We focus on an important sub-class in Theorem 16 . 4 1 below, namely 
the exponential family for the distribution of g in Q. For analytical tractability, we additionally 
assume that 9o has only one non-zero entry. The interest in the exponential family arises from 
the fact that normal-exponential scale mixtures give rise to the double-exponential family I32 I: 
6* I ^ ~ N(0,ipa'^),ip ~ Exp(l/2) implies 9 ~ DE(a), and hence this family of priors can be 



considered as a Bayesian version of the lasso [12 ih . We now state a concentration result for this 



class noting that a general version of Theorem 13. 5 1 can be found in Theorem 16.41 stated in Section 

m 

Theorem 3.5. Assume 9 ~ GL with f E and g = Exp (A) for some constant A > 0. Also 

1 1 2 

assume 9q has only one non-zero entry and ||^o|l2 > \ogn. Then, for a global constant C > 
depending only on M in the definition of J^, 

P(||^-^oll2<U<e-^^. (9) 



Theorem 13.51 asserts that even in the simplest deviation from the null model with only one 
signal, one continues to have exponentially small concentration under an exponential prior on the 
local scales. From (|5]) in Theorem 13.11 appropriate point mass mixture priors © would have 
P(||6' — 6'o||2 < tn) > e"*^"^""! under the same conditions as above, clearly showing that the wide 
difference in concentration still persists. 

3.4 Posterior lower bounds in normal means 

We have discussed the prior concentration for a high-dimensional vector 9 without alluding to any 
specific model so far. In this section we show how prior concentration impacts posterior inference 
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for the widely studied normal means problem U (see [lalSL 



0, 



1511 and references therein): 



yi = ei + ei, e, ~N(0,1), l<i<n. (10) 

The minimax rate s„ for the above model is given by s.^ = g„ log(n/g„) when 9q G IqIqu] n]. 
For this model [|6|] recently established that for point mass priors for with tt ~ beta(l, nn + 
1) and qq having Laplace like or heavier tails, the posterior contracts at the minimax rate, i.e., 
E„,6)oIP(||^ — 6^0 II 2 < Msn I y) — 1 for some constant M > 0. Thus we see that carefully chosen 
point mass priors are indeed optimalP. However not all choices for go lead to optimal proceedures; 
[Ig] also showed that if gg is instead chosen to be standard Gaussian, the posterior does not contract 

at the minimax rate, i.e., one cotildhaveEn,eo^i\\0 — %\\2 < \ y) ^ for signals of sufficiently 
large magnitude. This result is particularly striking given the routine choice of Gaussian for g^ in 
Bayesian variable selection and thus clearly illustrates the need for careful prior choice in high 
dimensions. 

To establish such a posterior lower-bound result, [0] showed that given a fixed sequence t„, if 
there exists a sequence r„ (r„ > t^) such that 

m^Ml^ = o(e-l) (11) 

then P(||6' — 6'o||2 < I y) 0. This immediately shows the importance of studying the 
prior concentration. Intuitively, (fTTI ) would be satisfied when the prior mass of the bigger ball 
II 6* — ^0 II 2 < is almost entirely contained in the annulus with inner radius t„ and outer radius r„, 
so that the smaller ball || 6^ — 6^0 1| 2 < tn barely has any prior mass compared to the bigger ball. As 
an illustrative example, in the i.i.d. N(0, 1) example with t„ = s„, setting r„ = y/n would satisfy 
(fTTI) above, proving that i.i.d. N(0, 1) priors are sub-optimal. Our goal is to investigate whether a 
similar phenomenon persists for global-local priors in light of the concentration bounds developed 



'Although we study the normal means problem, the ideas and results in this section are applicable to other models 
such as non-parametric regression and factor models. 

^It is important that the hyper parameter for tt depends on n. We do not know if the result holds without this 
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in Theorems 13.31 and 16.41 

As in Section 3.2, we first state our posterior lower bound result for the case where there is 
only a global parameter. 

Theorem 3.6. Suppose we observe y ~ Nn{Oo, In) and (flOl) is fitted with a GL prior on 9 such 
that ipj = Ifor all j and the prior f on the global parameter r lies in T. Assume G lo[(ln', n] 
where qn/n and \\0o\\2 > with = g„log(?7,/g„) being the minimax squared error loss 
overlQ[qn\n]. Then, E„,^0„P(||6' - 6*0112 < Sn\y) ^■ 

Proof. Without loss of generality, assume ||6'o||2 = o{y/n), since the posterior mass with a prior 
centered at the origin would be smaller otherwise. Choosing tn = Sn, r„ to be a sequence such that 
tn < rn < II 6*0 II 2 resorting to the two-sided bounds in Theorem 13.31 the ratio in (fTTI) is smaller 
than {tn/rn}"', and hence e^"{tn/rnY — )■ sincere < 116*0112 ^ o{y/n). □ 

Theorem 13.61 states that a GL prior with only a global scale is sub-optimal if ||^o|l2 > -^n- 
Observe that in the complementary region { 116^0 II2 < the estimator 6* = attains squared error 
in the order of g„ log(n/g„), implying the condition H^oL > is hardly stringent. 

Next, we state a result for the sub-class of GL priors as in Theorem I6.4[ i.e., when g has an 
exponential distribution leading to a double-exponential distribution marginally. 

Theorem 3.7. Suppose we observe y ~ Nn{Oo, In) o.nd the model in (flOl) is fitted with a GL prior 
on 9 such that f lies in T and g = Exp(A) for some constant A > 0. Assume 6q G lolqn] n] with 
qn = I and 116*0112 /log n ~^ Then, E,„^eyP(||6' — ^o|l2 — v^logn | y) — )■ 0. 

A proof of Theorem 13. 7 l is deferred to Section |6] From [6J, appropriate point mass mixture pri- 
ors would assign increasing mass with n to the same neighborhood in Theorem 13.71 Hence, many 
of the shrinkage priors used in practice are sub-optimal in high-dimensional applications, even in 
the simplest deviation from the null model with only one moderately sized signal. Although Theo- 



rem [XTlis stated and proved for g having an exponential distribution (which includes the Bayesian 
lasso [I21II ). we conjecture that the conclusions would continue to be valid if one only assumes g to 
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have exponential tails plus some mild conditions on the behavior near zero. However, the assump- 
tions of Theorem 13 .7 [ precludes the case when g has polynomial tails, such as the horseshoe [5] and 
generalized double Pareto [120. One no longer obtains tight bounds on the prior concentration for g 
having polynomial tails using the current techniques and it becomes substantially complicated to 
study the posterior. 

Another important question beyond the scope of the current paper should concern the behavior 
of the posterior when one plugs in an empirical Bayes estimator of the global parameter r. How- 
ever, we show below that the "optimal" sample-size dependent plug-in choice = c^/ \ogn (so 



that marginally 9j ~ DE(c/ v^log n) ) for the lasso estimator n2(M produces a sub-optimal posterior: 

Theorem 3.8. Suppose we observe y ~ Nn{OQ, I„) and (flOl) is fitted with a GL prior on 6 such that 
T is deterministically chosen to be r„, i.e., f = 6r„ for a non-random sequence Tn and g = Exp (A) 
for some constant A > 0. Assume 9q G lo[qn', n] with g„(logn)^ = o(n) and Tn = c/ logn is used 
as the plug-in choice. Then, E„ 0oIP(ll^ ~ ^o|l2 ^ Sn \ y) ^ 0, with = g„log(n/gn,) being the 
minimax squared error loss over /o[q'n; n]. 

A proof of Theorem 13 . 8 1 can be found in Section |6] Note that a slightly stronger assumption on 
the sparsity allows us to completely obviate any condition on 9q in this case. Also, the result can 
be generalized to any r„ if g„ logn/r„ = o{n). 

4. A NEW CLASS OF SHRINKAGE PRIORS 
The results in Section |3] necessitate the development of a general class of continuous shrinkage 
priors with improved concentration around sparse vectors. To that end, let us revisit the global-local 
specification Q. After integrating out the local scales ^/s, © can be equivalently represented as 
a global scale mixture of a kernel /C(-)> 



^/~/C(-,r), r^g, (12) 



where K,{x) = J i' ^^'^(l){x/ y/^)g{tp)dip is a symmetric unimodal density (or kernel) on R and 
/C(x, r) = r~^/^/C(x/ a/t). For example, ijjj ~ Exp(l/2) corresponds to a double exponential ker- 
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nel /C = DE(1), while 'ipj ~ IG(l/2, 1/2) results in a standard Cauchy kernel /C = Ca(0, 1). These 
traditional choices lead to a kernel which is bounded in a neighborhood of zero, and the resulting 
global-local procedure (fT2] ) with a single global parameter r doesn't attain the desired concentra- 
tion around sparse vectors as documented in Theorem 13.51 leading to sub-optimal behavior of the 
posterior in Theorem 13.71 

1 /2 

However, if one instead uses a half Cauchy prior ip-' ~ Ca+(0, 1), then the resulting horse- 
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shoe kernel [|4|, |50 is unbounded with a singularity at zero. This phenomenon coupled with tail 
robustness properties leads to excellent empirical performances of the horseshoe. However, the 
joint distribution of 9 under a horseshoe prior is understudied. One can imagine that it achieves 
a higher prior concentration around sparse vectors compared to common shrinkage priors since 
the singularity at zero potentially allows most of the entries to be concentrated around zero with 
the heavy tails ensuring concentration around the relatively small number of signals. However, 
the polynomial tails of ^/Jj present a hindrance in obtaining tight bounds using our techniques. We 
hope to address the polynomial tails case in details elsewhere, though based on strong empirical 
performance, we conjecture that the horseshoe leads to the optimal posterior contraction in a much 
broader domain compared to the Bayesian lasso and other common shrinkage priors. The normal- 
gamma scale mixtures jisj and the generalized double Pareto prior [2] follow the same philosophy 
and should have similar properties. 

The above class of priors rely on obtaining a suitable kernel JC through appropriate normal scale 
mixtures. In this article, we offer a fundamentally different class of shrinkage priors that alleviate 
the requirements on the kernel, while having attractive theoretical properties. In particular, our 
proposed class of kemel-Dirichlet (kD) priors replaces the single global scale r in (fT2l) by a vector 
of scales {(piT, . . . , 0„r), where (p = {(pi, . . . , 0„) is constrained to lie in the {n — 1) dimensional 
simplex 5""^: 

9j I (Pj, r - /C(- , (Pjt), (0, r) G 5""^ ® M+, (13) 
where JC is any symmetric (about zero) unimodal density that can be represented as scale mixture 
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of normals 113211 . While previous shrinkage priors in the literature obtain marginal behavior similar 
to the point mass mixture priors ([T), our construction aims at resembling the joint distribution of 6 
under a two-component mixture prior. Constraining on 5"^^ restrains the "degrees of freedom" 
of the 0/s, offering better control on the number of dominant entries in In particular, letting 
~ Dir(a, . . . , a) for a suitably chosen a allows (fT3l) to behave like ([U) jointly, forcing a large 
subset of (6'i, . . . , 9n) to be simultaneously close to zero with high probability. 

We focus on the Laplace kernel from now on for concreteness, noting that all the results stated 
below can be generalized to other choices. The corresponding hierarchical prior 

%~DE(0,r), </.~Dir(a,...,a), r ~ (14) 

is referred to as a Dirichlet Laplace prior, denoted DLa(r). In the following Theorem 14. 1[ we 
establish the improved prior concentration of the DL prior. For sake of comparison with the 
global-local priors in Section 3.3, we assume the same conditions as in Theorem 13.51 a general 
version can be found in Section |6l 

Theorem 4.1. Assume 9 ~ DLa{T) as in ([14)) with a = 1/n and r ~ Exp(A) /or some A > 0. 
Also assume 9q has only one non-zero entry and ||^o|l2 = clogri. Also, recall the sequence tn in 
dl]). Then, for a constant C depending only on 6 on \, 

Pi\\0 - 9o\\ < tn) > exp{-Cy^h^}. (15) 

From (|5]) in Theorem 13.11 appropriate point mass mixtures would attain exactly the same con- 
centration as in (fTSi) . showing the huge improvement in concentration compared to global-local 
priors. This further establishes the role of the dependent scales (p, since in absence of (p, a DE(r) 
prior with r ~ Exp(A) would lead to a concentration smaller than e"*"^ (see Theorem 13.51 ). 

To further understand the role of 0, we undertake a study of the marginal properties of 9j 
integrating out (pj. Clearly, the marginal distribution of 0-, is Beta(a, {n — l)a). Let WG(a, f3) 
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denote a wrapped gamma distribution with density function 



The results are summarized in Proposition l4.2l below. 

Proposition 4.2. If 6 \ (j),T ^ DLa(r) and (p ~ Dir(a, . . . , a), then the marginal distribution of 
6j given r is unbounded with a singularity at zero for any a < 1. Further, in the special case 
a = 1/n, the marginal distribution is a wrapped Gamma distributionWG{l/n,T~^). 

Thus, marginalizing over </), we obtain an unbounded kernel /C (similar to the horseshoe). Since 
the marginal density of 9j \ r has a singularity at 0, it assigns a huge mass at zero while retaining 
exponential tails, which partly explains the improved concentration. A proof of Proposition 14.21 
can be found in the appendix. 

There is a recent frequentist literature on including a local penalty specific to each coefficient. 



The adaptive Lasso Oil. 13511 relies on empirically estimated weights that are plugged in. [18] 
instead propose to sample the penalty parameters from a posterior, with a sparse point estimate 
obtained for each draw. These approaches do not produce a full posterior distribution but focus on 
sparse point estimates. 

4.1 Posterior computation 

The proposed class of DL priors leads to straightforward posterior computation via an efficient 
data augmented Gibbs sampler. Note that the DLa(r) prior (fT4] ) can be equivalently represented as 



Oj ~ N{0, ^j<j)y), i), ~ Exp(l/2), ~ Dir( 



a a). 



In the general DLa(r) setting, we assume a gamma(A, 1/2) prior on r with A = na. In the special 
case when a = the prior on r reduces to an Exp (1/2) prior consistent with the statement of 
Theorem 14. II 

We detail the steps in the normal means setting but the algorithm is trivially modified to accom- 
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modate normal linear regression, robust regression with heavy tailed residuals, probit models, lo- 
gistic regression, factor models and other hierarchical Gaussian cases. To reduce auto-correlation, 
we rely on marginalization and blocking as much as possible. Our sampler cycles through (i) 
6* I ^, 0, r, y, (ii) ^/^ | 0, r, 9, (iii) t \ (j),9 and (iv) | 9. We use the fact that the joint posterior of 
{if), 0, r) is conditionally independent of y given 9. Steps (ii) - (iv) together gives us a draw from 
the conditional distribution of (^/;, 0, r) | 9, since 

[^,0,r|^] = [^|0,r,^][r|0,^][0|^]. 



Steps (i) - (iii) are standard and hence not derived. Step (iv) is non-trivial and we develop an 
efficient sampling algorithm for jointly sampling 0. Usual one at a time updates of a Dirichlet 
vector leads to tremendously slow mixing and convergence, and hence the joint update in Theorem 
|4.3| is an important feature of our proposed prior. 

Theorem 4.3. The joint posterior of (p \ r has the same distribution as (Ti/T, . . . , Tn/T), where 
Tj are independently distributed according to a giG{a — l, l,2\9j\) distribution, andT = Yl]=i 'hi- 
proof. We first state a result from the theory of normalized random measures (see, for example, 
(36) in y/ZD). Suppose Ti, . . . , T„ are independent random variables with Tj having a density fj on 



(0, oo). Let 0j = Tj/T with T = J2]=i Tj. Then, the joint density / of 
on the simplex 5*^"^ has the form 



''1, • • • , Vn-l, 



supported 



/(01, . . . ,0„_l) 



t=0 



(16) 



where 0n = 1 — YTj=i 0j - Integrating out r, the joint posterior of | 6* has the form 



7r(,0i, 



OC 



n 



g-r/2^A-n-lg-E"=i|e,|/(</'ir)^^_ 



(17) 



T=0 
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Setting fj{x) oc Jj-e-l^^l/^'e"^/^ in (O, we get 



1, ■ ■ ■ , Vn-l) 



n ^ ■ 
04 



t=0 



We aim to equate the expression in (fTST ) with the expression in (flTl) . Comparing the exponent of 
0j gives m 5 = 2 — a. The other requirement n — 1 — n5 = X — n — 1 is also satisfied, since 
A = na. The proof is completed by observing that fj corresponds to a giG(a — 1, 1, 2|6'j|) when 
6 = 2 -a. □ 

The summary of each step are finally provided below. 

(i) To sample | ^/;, 0, r, y, draw 9j independently from a N(yUj, crj) distribution with 

a] = {l + /i, = {1 + 

(ii) The conditional posterior of tp \ (I),t, 9 can be sampled efficiently in a block by independently 

sampling ipj \ (p, 9 from an inverse-Gaussian distribution iG {fij, A) with /ij = (f)jT/\9j\, A = 
1. 

(iii) Sample the conditional posterior of r | </>, 6* from a giG( A — n,l,2 Yl]=i 'f^j) distribution. 

(iv) To sample (p \ 9, draw Ti, . . . , T„ independently with Tj ~ giG(a — 1, 1, 2|^j|) and set (pj = 

T,/T with T = E;=iT,- 

5. SIMULATION STUDY 
Since the concentration results presented here are non-asymptotic in nature, we expect the theoreti- 
cal findings to be reflected in finite-sample performance. In particular, we aim to study whether the 
improved concentration of the proposed Dirichlet Laplace (DLi/„) priors compared to the Bayesian 
lasso (BL) translate empirically. As illustration, we show the results from a replicated simulation 
study with various dimensionality n and sparsity level g„. In each setting, we have 100 replicates 
of a n-dimensional vector y sampled from a N„(6'o, In) distribution with 9o having g„ non-zero 
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entries which are all set to be a constant A > 0. We chose two values of n, namely n = 100, 200. 
For each n, we let g„ = 5, 10, 20% of n and choose A = 7,8. This results in 12 simulation settings 
in total. The simulations were designed to mimic the setting in Section 3 where 6^0 is sparse with a 
few moderate- sized coefficients. 

Table 1: Squared error comparison over 100 replicates 



n 100 200 



qn 5 10 20 5 10 20 



A 787878787 8 7 8 

33.05 33.63 49.85 50.04 68.35 68.54 64.78 69.34 99.50 103^15 133l7 1303 

DLi/n 8.20 7.19 17.29 15.35 32.00 29.40 16.07 14.28 33.00 30.80 65.53 59.61 

LS 21.25 19.09 38.68 37.25 68.97 69.05 41.82 41.18 75.55 75.12 137.21 136.25 

EBMed 13.64 12.47 29.73 27.96 60.52 60.22 26.10 25.52 57.19 56.05 119.41 119.35 

PM 12.15 10.98 25.99 24.59 51.36 50.98 22.99 22.26 49.42 48.42 101.54 101.62 

HS 8.30 7.93 18.39 16.27 37.25 35.18 15.80 15.09 35.61 33.58 72.15 70.23 



The squared error loss corresponding to the posterior median averaged across simulation repli- 



cates is provided in Table \T\ To offer further grounds for comparison, we have also tabulated the 
results for Lasso (LS), Empirical Bayes median (EBMed) as in [15] P, posterior median with a 
point mass prior (PM) as in H and the posterior median corresponding to the horseshoe prior [|5|]. 
For the fully Bayesian analysis using point mass mixture priors, we use a complexity prior on the 
subset-size, 7r„(s) oc exp{— loe(2n/s)} with k, = 0.1 and independent standard Laplace priors 
for the non-zero entries as in [6i. ^ f 

Even in this succinct summary of the results, a wide difference between the Bayesian Lasso 
and the proposed DLi/„ is observed in Tabled] vindicating our theoretical results. The horseshoe 
performs similarly as the DLi/„. The superior performance of the DLi/„ prior can be attributed 
to its strong concentration around the origin. However, in cases where there are several relatively 



The EBMed procedure was implemented using the package ua\ . 



"^Given a draw for s, a subset S of size s is drawn uniformly. Set 6j = for all j ^ S and draw 6j,j G S i.i.d 



from standard Laplace. 

^The beta-bernouUi priors in dl} induce a similar prior on the subset size. 
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small signals, the DLi/„ prior can shrink all of them towards zero. In such settings, depending on 
the practitioner's utility function, the singularity at zero can be "softened" using a DL^ prior for a 
smaller value of a. Based on empirical performance and computational efficiency, we recommend 
a = 1/2 as a robust default choice. The computational gain arises from the fact that in this case, 
the distribution of Tj in (iv) turns out to be inverse-Gaussian (iG), for which exact samplers are 
available. 

Table 2: Squared error comparison over 100 replicates 
n 1000 



BL 299.30 385.68 424.09 450.20 474.28 493.03 

HS 306.94 353.79 270.90 205.43 182.99 168.83 

DLi/„ 368.45 679.17 671.34 374.01 213.66 160.14 

DLi/2 267.83 315.70 266.80 213.23 192.98 177.20 



For illustration purposes, we choose a simulation setting akin to an example in [SI], where one 
has a single observation y from an = 1000 dimensional N„(6'o, I„) distribution, with 6^0 [1 : 10] = 
10,^o[ll : 100] = A, and ^o[101 : 1000] = 0. We the vary A from 2 to 7 and summarize the 
squared error averaged across 100 replicates in Tabled We only compare the Bayesian shrinkage 
priors here; the squared error for the posterior median is tabulated. Table |2] clearly illustrates the 
need for prior elicitation in high dimensions according to the need, shrinking the noise vs. signal 
detection. 



6. PROOFS OF CONCENTRATION RESULTS IN SECTION 3 
In this section, we develop non-asymptotic bounds to the prior concentration which are subse- 
quently used to prove the posterior lower bound results. An important tool used throughout is a 
general version of Anderson's lemma [|30ll . providing a concentration result for multivariate Gaus- 
sian distributions: 
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Lemma 6.1. Suppose 9 ~ Nn{0, S) with S p.d. and 9q e W\ Let \\9o\\\ = 9qE ^9o. Then, for 
any t > 0, 

e-l li^ollePdl^ll^ < t/2) < F{\\9 - 9o\\^ < t) < e's \\'^\\iF{\\9\\^ < t). 

It is well known that among balls of fixed radius, a zero mean multivariate normal distribution 
places the maximum mass on the ball centered at the origin. Lemma [6T| provides a sharp bound 
on the probability of shifted balls in terms of the centered probability and the size of the shift, 
measured via the RKHS norm 1 16*0 lie- 

For GL shrinkage priors of the form (|2}, given tp = (^i, . . . , tpnY and r, the elements of 9 are 
conditionally independent with 9 \ ijj,T N„(0, S) with S = diag(?/'ir, . . . , ^pnT). Hence we can 
use Lemma [67n to obtain 

e-i/(2r)E".i<^g,M Pdl^ll^ < tj2 \tlj,T)< F{\\9 - 9o\\, < t„ | ^,r) 
< e-^/(2")^i=i^o,Mp(||^||2 <t„ I ^^r). (19) 

Letting Xj = 9'^, Xfs are conditionally independent given (r, tp) with Xj having a density f{x \ 

T, ip) = D/(^/^^)e-^/(2rV',) on (0, oo), where D = 1/(V2^). Hence, with w„ = t^, 

r " 1 

Pdl^ll^ < t J r) = DM TT g-x,/(2r^,)^^_ (20) 

For sake of brevity, we use — ^n) (l20l) and all future references to denote the region 

{x : Xj > OVj = l,...,n, X]j=i^i — ^n}- To estimate two-sided bounds for the marginal 
concentration P(||6' — ^^o|l2 — ^n)' '^^ need to combine ( fT9l ) & (|20|) and integrate out tp and r 
carefully. We start by proving Theorem I3.2I & Theorem I3.3l where one only needs to integrate out 
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6 . 1 Proof of Theorem |321 

In (|20l ). set tpj = 1 for all j, recall D = 1/ ^/2iv and Wn = and integrate over r to obtain, 



1^112 < tn) = 



fir) 



T=0 



n 



dr. 



(21) 



Substituting /(r) = cr (i+")e ^/"^ with c = /3"/r(a) and using Fubini's theorem to interchange 
the order of integration between x and r, (|2TI) equals 



cD« /■ J] J- 



^- (l+n/2+a) g - 2^ (2/3+ J] X, ) 



L "'r=0 



a] 



n 

,=1 v-^J 



dx 



cD"2"/2+°^^/2r(n/2 + 



n 



rfx. (22) 



We now state the Dirichlet integral formula (4.635 in UM) to simplify a class of integrals as above 
over the simplex A"^^: 

Lemma 6.2. Let h{-) be a Lebesgue integmble function and > 0, j = 1, . . . , n. Then, 



i=i 



Lemma [6]2] follows simply by noting that the left hand side is IE/i(X]j=i -^i) "^P ^'^ normalizing 
constants where (Xi, . . . , X„) ~ Diri(Q;i, . . . , a„, 1), so that YTj=i -^j ~ Beta(E a^, 1). Such 
probabilistic intuitions will be used later to reduce more complicated integrals over a simplex to a 
single integral on (0, 1). 

Lemma lO with /i(t) = 1/(2/3 + w„t)"/2+a applied to ^ implies 



P(||^||2 < tn) = cD"2"/2+°y;V2r(^/2 + 



a) 



r(i/2)" 



r(n/2) y,=o (2/3 + w;.t)"/2+- 



dt. (23) 



Substituting = l/v^, bounding (2/3 + > (2/3)"+i(2/3 + M;„t)"/2-\ and letting 
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u!n = Wn/ (2/3), (l23l) Can be bounded above by 



r(n/2 + a) 



r(n/2)r(a)(2/3) 



.0 (1 + - r(n/2)r(a)(2/3)-+i Vl + 



n/2-1 



where the second inequality above uses t/{a + t) is an increasing function in t > for fixed 
a > 0. By definition, w„ = ra'' for < 5 < 1 and hence r (n/2)^aK2/3) ° + 1 '^^'^ bounded 
above by c'^ii°g"-. Also, using (1 — xY^^ < e for all a; > 0, {i2;„/(l + Wn)}"^'^~^ can be bound 
above by e~'^^"'^^" = e"'"^"^ Hence the overall bound is e"*""^ * for some appropriate constant 
C > 0. □ 

6.2 Proof of Theorem |331 

We start with the upper bound in (|7]). The steps are similar as above and hence only a sketch is 
provided. Bounding /(r) < M and interchanging order of integrals in (1211 . 

Pdl^ll, < t.) < MD"2"/^-ir(n/2 - 1)^. / TvtWt n -7= (24) 
Invoking Lemma [6^ with /i(t) = (l/t)"/^~^ in (l24l) . the upper bound in (|7]) is proved: 



(M/2) 



n/2-1 



We turn towards proving the lower bound to the centered concentration in (|7]). Recalling that 
/(t) > l/Af on (0, 1) for f E J-', and interchanging integrals in (jlTT) . we have, with K = 1/M, 



0\\2 < tn) > KD' 



n 

j = l V J 



i L"'r=0 



(25) 



We state Lemma [63] to lower bound the inner integral over r; a proof can be found in the Appendix. 
Recall /^"^pT-^/^e-^^/^^^^rfr = r(n/2 - l)(2/a„)"/2-i. Lemma [63] shows that the same integral 
over (0, 1) is of the same order when an n. 

Lemma 6.3. For a sequence an < n/{2e), /^^^^ r-"/2e-a„/(2r)^^ > (2/a„)"/2-ir(n/2 - 1)^„, 
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where t 1 ^'^^ (1 ~ Cn) < D/y/nfor some constant D > 0. 



Clearly ^ xj < Wn and hence we can apply Lemma [63] in (|25] ) to get 



< tn) > i^enI^"2"/2"ir(n/2 - l)u;„, / , n ^ (26) 



The rest of the proof proceeds exactly as in the upper bound case from (1241) onwards. □ 
Finally, we combine Anderson's inequality (fT9] ) with (|20l) (with ^/'j = 1 for all j in this case) to 
bound the non-centered concentrations in (|8]). For the upper bound, we additionally use /(r) < M 
for all r to obtain 



-n/2g-[|19o||^+E^i]/{2r)^, 



r 1 r 

mO-Ooh<tn)<MD'' / 

Jj2xj<^n j^i v-^i lJt= 

= MD"2"/2-ir(n/2 - l)w:f [ 5 i ^ — ff — 



MD"2"/2-ir(r2/2 



^ " r(n/2) 



Ex,)"/^ 

2,n/2-l 



n/2-1 



(27) 
(28) 
(29) 



In the above display, (l28l) - ( |29l ) follows from applying Lemma \6J\ with /i(t) = l/(||6'o||2 + 
Wnt)"'^'^~^- Simplifying constants in (|29l) as before and using t/ {a + t) is an increasing function in 
t > for fixed a > 0, we complete the proof by bounding (|29l ) above by 



CWr, 



.n/2-1 



{n/2-1) 7.=o(||^o|l2 + ^nx)-/ 



2-1 



-dx < 



CWn 



Wr, 



{n/2-l)\wn + \% 



n/2-1 



< 



CWn 



(n/2-1) V 11^0 



Wr. 



n/2-1 



The right hand side of the above display can be bounded above by e~'^"^°^"" for some constant 
c > 0. Remark (|3.4I) readily follows from the above display; we didn't use the condition on ||6'o||2 
so far. 

For the lower bound on the prior concentration in the non-centered case, we combine Ander- 
son's inequality (fT9l) in the reverse direction along with (|20|) . We then use the same trick as in the 
centered case to restrict the integral over r to (0, 1) in (|30l) . Note that the integral over the x's is 
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over J2 ^ with f„ = as a consequence of (fT9l) . Hence, 



e - 9o\\, < tn) > KD'' [ n ^ i /' 



-n/2g-[|ieoli^+E^-.]/(2r)^^ 



dx. 



(30) 



Noting that ||^o|l2 + Xl^i — ll^o|l2 + ^'n = o{n), we can invoke Lemma |63] to lower bound the 
inner integral over r by ^„r(n/2 — l)2"/^"^/(||6'o||2 + X] and proceed to obtain the same 

expressions as in (|28T ) & (|29] ) with M replaced by K^n and U7„ by w„. The proof is then completed 
by observing that the resulting lower bound can be further bounded below as follows: 



CVn 



n/2-1 



dx > 



CVn 



{n/2 - 1) 7,=o dl^oll' + Vnxr/^-^ - {n/2 - 1) J,^,/, {\\eo\\l + v^x)^/' 



(VnX) 



n/2-1 



/2-1 



-dx 



> 



CVn 



fn/2 



ra/2-l 



> 



(n/2-1) V2 11^, 



n/2-1 



where the last inequality uses t„ < ||^o|l2 that ||6'o||2 + ^'n < 2 ||6'o"^ 



□ 



2 ''^ II''U|I2 I '^n^^iru|l2- 

To prove Theorem |331 we state and prove a more general result on concentration of GL priors 



Theorem 6.4. Assume 9 ~ GLwith f G T and g = Exp (X) for some constant X > 0. Also assume 
9o has only one non-zero entry. Let Wn = t"^. Then, for a global constant Ci > depending only 
on M in the definition ofT, 



m9-9oL<U<Ci 



(n-3)/2 



Vi=0 {^1 + \\9q\\^ / {TlWn)] 



(n-3)/2 



Let Vn = r"^/ 4 satisfy Vn = 0{^/n). Then, for \\9o\\2 > l/-\/n> 



9-9o\\^<rr,)>C2e-'''^ 



^l=ci||eoll^ {V'l + \\9o\\2/{'^Vn)} 



(n-3)/2^ "'/^l 



(31) 



(32) 



where ci,d2,C2 are positive global constants with ci > 2 and C2 depends only on M in the 
definition of J^. 
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6.3 Proof of Theorem |631 

Without loss of generality, we assume g to be the Exp(l) distribution since the rate parameter 
A can be absorbed into the global parameter r with the resulting distribution still in J^. Also, 

1 1 2 I 2 

assume the only non-zero entry in 6q is ^^oi^ so that 116*0112 ~ roil • The steps of the proof follow 
the same structure as in Theorem 13.31 i.e., using Anderson's inequality to bound the non-centered 
concentration given ^, r by the centered concentration as in (fT9] l and exploiting the properties of 

to ensure that the bounds are tight. A substantial additional complication arises in integrating 
out tp in this case, requiring involved analysis. 

We start with the upper bound (1311 . Combining ([19] ) & (|20] ). and bounding /(r) < M yields: 



6* — 6*0112 —'^r, 



't=0 Jij) 



n „ n 



T = 



n 



dr 



-1/2 



dipdr 
dipdx 



E^.<1 [ll^olIsM + W^nE^iM]"^^"^ 



dx 



dip. 
(33) 



Comare (1331) with (1281) . The crucial difference in this case is that the inner integral over the simplex 
J2]=i — 1 is no longer a function of J2]=i ^j' rendering Lemma l6^ inapplicable. An important 
technical contribution of this paper in Lemma |631 belo w is that complicated multiple integrals over 
the simplex as above can be reduced to a single integral over (0, 1): 



Lemma 6.5. Let aj = 1/2 for j = 1, . . . ,n and qj, j = 0,1, . . . ,n be positive numbers. Then, 



e.,<iE;.i*^j + *1"/^-' r(„/2) 



dx = } ' L qo{n/2 - 1) 



X 



-dx. 



A proof of Lemma |63] can be found in the Appendix. We didn't find any previous instance 
of Lemma [631 though a related integral with n/2 in the exponent in the denominator appears in 
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||12|1 . Our technique for the proof, which utilizes a beautiful identity found in |17|] can be easily 
generalized to any aj and other exponents in the denominator. 

Aplying Lemma [63] with go = ||^o|l2 /'^i ^^'^ 'ij = '^n/'^j to evaluate the inner integral over x. 



331) equals 

1 



{M\\e4l/2)wT 



t=i v^^- 



^1 Jx=o riLi V i.'^nX/lpj + go) 



ciajrfV', (34) 



noting that {n/2 - l)D'^2'^/^-^T{n/2 - 1) r(l/2)7r(n/2) = 1/2. 

So, at this point, we are down from the initial (2n + 1) integrals to (n + 1) integrals. Next, 
using g{i>j) = e~^^l{il)j > 0) to integrate out tjjj, j = 2, . . . ,n, (|34l) equals 



(M||^o||^/2)</^ / ^4^^ / , ' , dxdi;,. 



(35) 



Using a standard identity and an upper bound for the complementary error function erfc(z) 
/t=^ e~^^dt (see lA.7l in the Appendix), 



# = -^exp(ii;„a;/go)erfc(A/tf„a;/go) < 



1^=0 VwnX + ^go i/go V^WnX + go/vT 

Hence, the expression in (1351) can be bounded above by 



(M/2)||^o|l2</' / ^ / , = ^-^^ dxdtP, 

= (M/2)||^o||^</^ r e-^^^r^)/^/' ^ r^^^#i 

'^'^^=° '^-=° y/Kx+ ll^oll^) [^nx^i + ll^oll^/vr]^"^ 

(36) 

Let us aim to bound the inner integral over x in (l36l) . We upper bound (1 — x) in the numerator by 1, 



lower-bound J {wnX + ||6'o||2) in the denominator by y ||6'o||2 and multiply a JwnXipi + ||^o|l2 /'''" 
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temi in the numerator and denominator to get 

x^l^~\l-x) 



=0 yiL^nX + 116*0112 [WnXllJl+ PqW^/t^] 



We use the fact that jl^^ x^'l^-^/^ax + PY'^dx = 2{a + /Sy-''/'^ /{l3{n - 2)} to conclude that the 
last line in the above display equals 



^Jwn^l^l + \\eo\\l/7l 2tt + H^oII' A)' 

^ /' / , ll/l l|2 / \-(n-3)/2 



^^ll^o||^(n/2-l) 
Substituitng this in (l36l) . we finally obtain: 



+ 116*0112 /tt) 



(n/2-l) Vll^oll^ h.=o{^, + \\e,fj{'nWn)Y 



n\\e - 94, < tn) < / - ^(^^^-"^^^1, (37) 



where Ci > is a global constant (depending only on M). (|3T1) clearly follows from (ITTT l. □ 
Lower bound: We proceed to obtain a lower bound to P( || 6* — 6*0 1| 2 < rn) similar to (iJTT l under 
additional assumptions on r„ as in the statement of Theorem 16 .41 To that end, note that in the proof 
of the upper bound here, we used only two inequalities until (|34|) : (i) Anderson's inequality in 
( fT9l ) and (ii) upper bounding /(r) by M. As in the proof of the lower bound in Theorem 13.31 we 
obtain a lower bound similar to the expression in (|34] ) by (i) using Anderson's inequality ([19] ) in 
the reverse direction, and (ii) using /(r) > K on (0, 1): 



F(||^-^o|l2<^« 



d?/^ dx.0%) 
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However, unlike Theorem 13 .31 we cannot directly resort to Lemma [63] since a„ = ||6'o||2 /^Ai + 
Yl^j=i^j I'^j be arbitrarily large if ^/s are close enough to zero. This necessitates a more 
careful analysis in bounding below the expression in (|38T l by constraining the ?/;j's to an appropriate 
region F away from zero: 



ci ||6'o||2 < ^1 < C2 ||6'o||2 , iJj > Cg/A/n, j = 2, . . . 
In the above display, ci < C2 and C3 > 1 are positive constants to be chosen later, that satisfy 

1/ci + max{l/(ci ll^olls), v^/caj^n < n/{2e). (39) 

With (|39| ). we can invoke Lemma l6Jl to bound below the integral over r in (l38l) . since for G F, 

Po\\l/^i+YTj=i^j/^j < l/ci+max{l/(ci ||6'o||2), v^/c3}Xl"=i^i < l/ci+max{l/(ci ||6'o||2), v^/cs}? 
n/(2e) by (|39| ). The resulting lower bound is exactly same as (|33T ) with M replaced by and 
Wn by Wrij where t 1 is as in Lemma [6Jl As in the upper bound calculations (|33T ) - (|34|) . we 
invoke Lemma [631 with go = ||^o|l2 /V^i Qj = Vn/'ipj to reduce the multiple integral over the 
simplex and bound the expression in (|38T ) below by 



3/4 a;n/2-2Q_3,^ 

dxdip 



(A^||^o|l2/2)a</' 



JV'i=ci||(?o||' ^1 A=i/2 V^^na; + go^i Uv=c3/v^ V + m J 

(40) 

Note the inner integral over x is restricted to (1/2, 3/4). Now, 

r , , dj, = ^e^-^Z-^^erfc {jvnx/q^ + c^/V^ . (41) 

We use a lower bound on the erfc function (see lA.Sl in the Appendix for a proof) which states that 
for z > 2, 0Fe^erfc(v^) > (l/v^) for any 6 > 0. Since we have restricted x > 1/2 in 
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(|40|) and fnV'i/ 116*0112 — '^i^n '^^ have \/vnxlqQ + c-il\fn > ^Jc{ provided Vn > 1. Thus, 
choosing c\ > 2, we can apply the above lower bound on the error function to bound the expression 
in the r.h.s. of (|4TI) as: 



1 



l+<5 



v/^;„,a;/go + 3/(47r) (1 + cs)-^ (1 + cs)-^ ^t;„x + 3go/(47r) ' 



In the second to third step, we used that f„x/go + Czl \fn < Vnx/qo + 3/(47r) for n larger than 
some constant. We choose 6 = l/{n — 1) and substitute the above lower bound for the l.h.s. of 
(HTT) into (gOl). This allows us to bound (gO]) below by 



C2\\9of , /•3/4 

^-^,^(n-3)/2 

?/'l=ci||6»o|P Jx=l 



(n-l)/2 



rfa;#i. (42) 



Let us tackle the integral over x in (l42l) . To that end, we first lower-bound (1 — x) in the numerator 
by 1/4, upper-bound \/ VnX + || 6^0 1| 2 in the denominator by \/f „ + || ^0 1| 2- Next, we use the formula 



3/4 



x=l/2 



X 



n/2-2 



[ax + ^yi' 



-dx 



2(a + 4/3/3)^-"/^ 
Pin - 2) 



4/3/3 I 
f2/3 j 



with a = Vnipi and /3 = 3 H^olla /(4vr). Now, (a + 4/3/3)/(a + 2/3) = 1 - 2/3/{3(a + 2/3)} is 

1 1 2 1 1 2 

bounded away from and 1 since ci || 6*011 2 < a < C2 ||6'o||2- Thus, 



a -f 4/3/3 1"/'-'^ 



a 



2/3 j 



> 1/2 
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for n large. Substituting all these in (|42l ). we finally obtain: 

(43) 

where C2 > is a global constant depending only on K in the definition of T and ^„ f 1 with 
1 — ^„ < D/ for some constant D > 0. We only required ci > 2 so far. Since 1 1 ^0 1 1 2 — V ^A^' 
choosing ci and C3 to be sufficiently large constants, (|39| ) can always be satisfied. The proof of (|32] ) 



clearly follows from (|43T ). since {„t'„/ (n/2 — 1) / 2 can be bounded below by e □ 
6.4 Proof of Theorem |3J] 

Let m„ = (n — 3)/2. We set t„ = s„, where s„ is the minimax rate corresponding to g„ = 1, so 
that Wn = sl^ = logn. Also, let ||6'o||2 = ttw^m^, where m„ is a slowly increasing sequence; we 
set Un = log(logn) for future references. Finally let Vn = = y/m^. With these choices, we 
proceed to show that (fTTT) holds. 

We first simplify (|37T ) further. The function x — )■ x/x{x + a) monotonically increases from 
to 1 for any a > 0. Thus, for any r„ > 0, 

7 ^ r^e-'^i#i 

V,i=0 {V'l + ll^o|l2/(^^n)} 

-y^,=o{Vi + ii^oii2/(™n)} " 4=T„ v^n + <; 

We choose an appropriate T„ which gives us the necessary bound, namely Tn = Uny/rn^- Then, 
using the fact that (1 — xY^^ < e^^ for all x E (0, 1), we have 



" j ^ g-m„u„/(^m„+u„) ^ ^-u„^mn/2 



where for the last part used that e~^/^ is an increasing function and + m.„ < 2Jm^. Thus, 
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substituting T„ in (|44l) yields, for a global constant Ci > 0, 



(^-U2<sn)< , J-^ e--v^/^ (45) 



in/2-1) V ll^oL 

Next, again using the fact that x — x/x{x + a) is monotonically increasing, and choosing C2 = oo, 
we simplify the lower bound (|43T) . Observe 

^1 



i,^=ci\\9of {i^l + 116*0112 /(VTW^)}"'" 



for some constant C > 0. Finally, using (1 — x)^/^ > e ^ for all x E (0, 1/2) and e is an 
increasing function in x > 0, we have. 



Hence, the integral is bounded below by e~^^^~^^^^^^°^^^^^'^ , resulting in 



(^/2-l) V^n+ll^olls 



Thus, finally, noting that m.„ — )■ oo. 



3 /2 

F(||^ - ^nll2 < s„) ^ ^ ^c(,/y^+v/;?+||eo||^) g-«„v^/2 ^ g 



- 6*0112 <r„) v„ 
where C, > are constants. □ 
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6.5 Proof of Theorem |3j 

As before, we assume A = 1 w.l.g., since it can be absorbed in the constant appearing the sequence 
r„ otherwise. As in the proof of Theorem 13.71 combine (fT9] ) & (|20] ) to obtain 

P(||^-^o|| <tn) < I^X^"/' / [ f[^—exp( ~^^^^)dx\d^|J 

where Wn = f^- Using the fact exp { — (f + a;) }dx = -/Tre^^^, we obtain 

P(||^-^0|| <tn) 

-V T„ j jj.^^^^n^''^ I r„ j ^ r(n/2+l)' 

where the second to third inequality uses Xj > and the last integral follows from Lemma 16.21 
Along the same lines, 




where u„ = From the second to third equation in the above display, we used \/a + b < 

-y/a + y/b and < ^Jn by Cauchy-Schwartz inequality if x G A"~^. Thus, from (1471) & 

(l48l) . the ratio in (fTTT ) can be bounded above as: 

P(||^-^0|| <rn)-\Vn) 
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Choose tn = Sn, Tn = 2\/2s„ SO that Vn = 2wn = 2g„log(?7,/g„) and {wn/vn)'"'^'^ = e~*^". Clearly 
Vn^/Tn < Cnqni\ognY and hence, eV^^""-/"^" = o(e'"") by assumption. Thus, the right hand side 
of the above display — t- 0, proving the assertion of the Theorem. 



Proof of Theorem lO 

First, we will state a more general result on the concentration of DLi /„ (t) when r ~ Exp(A). The 
result follows from a straightforward modification of Lemma 4. 1 in [|22|l and the detailed proof is 
omitted here. Assume 5 = tn/ {2n). For fixed numbers < a < 6 < 1, let 



(n - qn)S (g„ - 1)6 (g„ - l)a 

Vn = I - ^ — , U = 1 



2g„log(r2/g„ 



2gn 



4g„ 



Also, without loss of generality assume that {1} C = supp(^o)j i-^-, 6*01 7^ 0. Let Si = So\{l}. 
If 9o G loiqn', n), it follows that 

mi^-^oll <tn) >CP(r G [2g„,4g„])A„E„, 



where C is an absolute constant, and 



An = exp <^ - qn log 2 - ^ 



1 — exp 



'2qJ)] _ 



a 



1 — exp < — 



/2^(1 - a/8)e„ 



In our case, |S'o| = l,9oi = -y/Iogn, and t„ = n^^'^. Hence An is a constant, i?„ = exp{—Kiy/[ogn} 
for some constant Ki and P(r G [2g„, 4g„]) is also a constant. Hence, under the assumptions of 
TheoremgJl Pi\\0 - Oo\\ < t„) > exp{-Cy/\og^}. 
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APPENDIX 

Proof of Proposition 14.21 

When a = 1/n, (pj ~ Beta(l/n, 1 — 1/n) marginally. Hence, the marginal distribution of 9j given 
r is proportional to 



1 / A \ 



Substituting z = (1 — so that (jyj = z / [1 + z), the above integral reduces to 

/•oo 



2=0 



In the general case, (pj ~ Beta(a, {n — l)a) marginally. Substituting z = — (pj) as before, 

the marginal density of 9j is proportional to 



poo / 1 \ ""^"1 



z=0 



1 + Z 



The above integral can clearly be bounded below by a constant multiple of 



Jz=0 



Resort to Lemma [63] to finish the proof. 

Proof of Lemma 16.31 

Using a simple change of variable. 



1 poo 

^-n/2g-a„/(2r)^^ ^ / z"/^-^ 6-"'^'/^ dz 

T=0 Jz = l 

2 \ n/2-1 poo / o \ "/2-1 
On 



N n/2-1 ^oo / 9 \ "/2-1 r /■an/2 1 

/ r"/2-2e-*rft = ( — r(n/2 - 1) - / r/2-2e-*rft 

/ Jt=an/2 XO-nJ [_ Jt=0 
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Noting that f^^^'^ t"/^ ^dt < aZ^'^ V ("-/2 — 1) and a„ < n/ (2e) by assumption, the last entry 
in the above display can be bounded below by 



n/2-l 



T{n/2 - 1) 



n/2-l n 

an 

rW2) 



> 



an 



n/2-1 r 



{n/{2e)} 



n/2-l 



Tin/2) 



Let ^„ = 1 - {n/(2e)}"/2-Vr(n/2). Using the fact that T{m) > m'^-^/^e-"' for any 
m > 0, one has T{n/2) > C{n/(2e)}"/2-i^ with C = e0F. Hence, (1 - {„) < C/^^ for 
some absolute constant C > 0. 



Proof of Lemma 16.51 

Let s = n/2, T = X;j=i QjXj + Qo and q'j 
equals 



ilj + Qo)- Then, the multiple integral in Lemma |63] 



1 

—-^Y[x;'-'dx 



1 ^ /g'xA"^-' A /T^"^-' 



It 



n 



,=1 



^1 nJ-J 



T 



L 



n 



(A.l) 



Now, we make a change of variable from x to 2;, with zj = q'-Xj /T fox j = 1 . . . ,n. Clearly, z also 
belongs to the simplex A'^""^\ Moreover, letting Zn+i = 1 — J2]=i one has Zn+i = qoXn+i/T, 
where Xn+i = 1 — Z]J=i ^i- Thus, by composition rule. 



T = — 

I'l 



Xr 



^n+l 



Zn/qn Zn/qo ^iM H h + ^^n+l /^O 



Let J = (^)j; be the Jacobian of the transformation and H = {^) ji = J ^- Then, 



(A.2) 



H. 



T2 



(T-g,X,)if/ = j 
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Clearly, \H\ = \Hi\Y[^^ijk with Hi = Tin — xq^ , where q = (gi, . . . , g^)"^ and \A\ de- 
notes the determinant of a square matrix A. Using a standard result for determinants of rank 
one perturbations, one has \Hi\ = T" |l„ - = T"(l - ^) = goT""\ implying \H\ = 

[qoT"-'^) YYj=i ^ = T^TT 11^=1 1j- Hence the Jacobian of the transformation is 



1^1 



nn I 1 



SO that the change of variable in (lA.ll ) results in 



n 



n 



T^dz 



n 



,.1 



1 \ 



E"=i ^.<l ('^l^l ^ '^"^^ + Zn+iy I n ''^ 



(A.3) 



where v 



go 



^. Now, the expression in (|A.3I) clearly equals 



E< Z/iZi H h I^n^n + 



n 



(A.4) 



where (Zi, . . . , Z„,) ~ Dir(ai, . . . , a„, 1). A profound result in [7] shows that expectations of 
functions of Dirichlet random vectors as above can be reduced to the expectation of a functional of 
univariate Beta random variable: 

Dickey's formula BTD: Let {Zi,--- ,Z„) ~ Dir(/3i,--- ,Pn,Pn+i) and Z^+i = 1 - EILi 
Suppose a < Yll=i t^j- Then, for uj > 0, 



r n+1 



E 



»-t+i 



En 



where X ~ Beta(6, a) with h = Yl]j=i l^j ~ 
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Applying Dickey's formula with Pj = aj = 1/2 for j = 1, . . . ,n, Pn+i = 1 and a = 2 (so that 
6=f + l- 2 = f-l), ^KM reduces to 

where X ~ Beta(6, a) with density f{x) = {n/2){n/2 — for X G (0, 1). Hence, 

(IA.5I) finally reduces to 



2 ; r(n/2) 7o n;=ife^+go) 

□ 



Error function bounds 

Let erfc(x) = e~^^dt denote the complementary error function; clearly erfc(x) = 2[1 — 

$(-\/2x)], where $ denotes the standard normal c.d.f. A standard inequality (see, for example. 
Formula 7.1.13 in |jl]) states 

^ < v^e^erfc(v^) < ^ (A.6) 



X + Vx^ + 2 x + ^/x + A/tt 

Based on (IA.6I) , we show that 



V^e^erfc(v^) < ^ (A.7) 
a/x + l/vr 

0Fe^erfc(v^) > <^ ^ ^ (A.8) 



where (IA.8I) holds for any 5 > provided x > 2. 



In view of (IA.6I) . to prove (IA.7I) it is enough to show that 2y^x~+T/V < + a/x + 4/7r, 
which follows since: 



(a/x + A/x + 4/7r)^ - 4(x + I/tt) = 2x + x + 4/7r - 4x > 
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To show (IA.8I) . we use the lower bound for the complementary error function in (IA.6I ). First, we 
will show that for any 5 > 0, x + \/x^ + 2 < 2x^+^ if x > 2. Noting that if x > 2 



Hence -\/a?~+~2 < x^^*^ if x > 2, showing that x + Va;^ + 2 < 2x^+'' if x > 2. Thus, we have, for 
X > 2 and any 5 > 0, 



v^e"erfc(v^) > (^-^^ 



1+5 



□ 
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